source('../../workflow/resources/annotateVariants.R')
sampleName <- 'Br38'
inputFolder <- '/cluster/work/bewi/members/jgawron/projects/CTC/input_folder'
annotations <- annotate_variants(sampleName, inputFolder)
For each cluster (defined by color), we computed a pairwise distance for each mutation pair that indicates how often the two mutations occur in the same private branch of cells from the cluster:
dist(M1, M2) = 0 (for M1 = M2)
dist(M1,M2) = 1 - (%samples where M1 and M2 are both in the same private branch of a cell from the cluster) (elsewise)
A private branch is defined as the path from a leaf to the node just below the LCA of this leaf to another leaf from the same cluster.
This is a generalization of the earlier method to find the top seperating mutations of pairs of leafs. The generalization was necessary to handle the larger clusters that were broken in more than 2 pieces.
clusterName <- "lightcoral"
d <- read.table(file.path(inputFolder, sampleName, paste0(sampleName, "_postSampling_", clusterName, ".txt")), header = TRUE, sep = "\t", stringsAsFactors = F, row.names = 1)
mat <- as.matrix(d)
mat[1:4, 1:4]
## chr14_50915926 chr9_130362124 chr12_39332481 chr8_104251027
## chr14_50915926 0.00000 0.800650 0.880900 0.942950
## chr9_130362124 0.80065 0.000000 0.828475 0.921700
## chr12_39332481 0.88090 0.828475 0.000000 0.952425
## chr8_104251027 0.94295 0.921700 0.952425 0.000000
For each position, we computed the percentage of samples that have a coverage of at least 3 at this position. This is meant as a simple score of the data quality of a position that can be used in addition to the separation score to pick mutations for the wet lab experiments. Furthermore, we added simple functional annotations to the variants.
coverage <- read.table(file.path(inputFolder, sampleName, paste(sampleName, "covScore.txt", sep = "_")), header = TRUE, sep = "\t", stringsAsFactors = F, row.names = 1)
coverage$variantName <- rownames(coverage)
head(coverage)
## covScore variantName
## chr14_50915926 0.1818182 chr14_50915926
## chr9_130362124 0.1818182 chr9_130362124
## chr12_39332481 0.2727273 chr12_39332481
## chr8_104251027 0.2727273 chr8_104251027
## chr9_132231400 0.1818182 chr9_132231400
## chr11_69703275 0.1818182 chr11_69703275
coverage <- inner_join(coverage, annotations, by = "variantName")
###Overview To get an overview, we plot the full distance matrix:
library(heatmaply)
## Loading required package: plotly
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Loading required package: viridis
## Loading required package: viridisLite
##
## ======================
## Welcome to heatmaply version 1.5.0
##
## Type citation('heatmaply') for how to cite the package.
## Type ?heatmaply for the main documentation.
##
## The github page is: https://github.com/talgalili/heatmaply/
## Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
## You may ask questions at stackoverflow, use the r and heatmaply tags:
## https://stackoverflow.com/questions/tagged/heatmaply
## ======================
heatmaply(mat)
mat2 <- mat
diag(mat2) <- 1
min_dist <- apply(mat2, 1, min) # find minimum distance to other mutations
selected_muts <- which(min_dist < 0.7) # select those below 0.5 say
mat3 <- mat[selected_muts, selected_muts]
This is what the distance matrix looks like now:
heatmaply(mat3)
To cluster mutations, we create a dendrogram based on the pairwise distances:
mat <- mat3
d_mat <- as.dist(mat)
hc <- hclust(d_mat, "average") ## hierarchical clustering of mutations based on distance matrix
par(cex = 0.6)
plot(hc, main = "Dendrogram based on average pairwise distance", sub = "", xlab = "Separating mutations")
No apparent clustering visible.